Search CORE

26 research outputs found

Deep Fragment Embeddings for Bidirectional Image Sentence Mapping

Author: Fei-Fei Li
Joulin Armand
Karpathy Andrej
Publication venue
Publication date: 22/06/2014
Field of study

We introduce a model for bidirectional retrieval of images and sentences through a multi-modal embedding of visual and natural language data. Unlike previous models that directly map images or sentences into a common embedding space, our model works on a finer level and embeds fragments of images (objects) and fragments of sentences (typed dependency tree relations) into a common space. In addition to a ranking objective seen in previous work, this allows us to add a new fragment alignment objective that learns to directly associate these fragments across modalities. Extensive experimental evaluation shows that reasoning on both the global level of images and sentences and the finer level of their respective fragments significantly improves performance on image-sentence retrieval tasks. Additionally, our model provides interpretable predictions since the inferred inter-modal fragment alignment is explicit

arXiv.org e-Print Archive

CiteSeerX

Large-Scale Video Classification with Convolutional Neural Networks

Author: Andrej Karpathy
George Toderici Sanketh Shetty
Li Fei-fei
Rahul Sukthankar
Thomas Leung
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 10/12/2014
Field of study

Convolutional Neural Networks (CNNs) have been es-tablished as a powerful class of models for image recog-nition problems. Encouraged by these results, we pro-vide an extensive empirical evaluation of CNNs on large-scale video classification using a new dataset of 1 million YouTube videos belonging to 487 classes. We study multi-ple approaches for extending the connectivity of the a CNN in time domain to take advantage of local spatio-temporal information and suggest a multiresolution, foveated archi-tecture as a promising way of speeding up the training. Our best spatio-temporal networks display significant per-formance improvements compared to strong feature-based baselines (55.3 % to 63.9%), but only a surprisingly mod-est improvement compared to single-frame models (59.3% to 60.9%). We further study the generalization performance of our best model by retraining the top layers on the UCF-101 Action Recognition dataset and observe significant per-formance improvements compared to the UCF-101 baseline model (63.3 % up from 43.9%). 1

CiteSeerX

Crossref

ImageNet Large Scale Visual Recognition Challenge

Author: A Geiger
A Torralba
Aditya Khosla
Alexander C. Berg
Andrej Karpathy
B Alexe
B Yao
C Liu
C Vondrick
DG Lowe
GA Miller
Hao Su
J Uijlings
Jia Deng
Jonathan Krause
K Crammer
KEA Sande van de
KEA Sande van de
Li Fei-Fei
M Everingham
M Everingham
Michael Bernstein
Olga Russakovsky
P Arbelaez
P Felzenszwalb
S Thorpe
Sanjeev Satheesh
Sean Ma
T Ahonen
Zhiheng Huang
Publication venue
Publication date: 01/01/2015
Field of study

The ImageNet Large Scale Visual Recognition Challenge is a benchmark in object category classification and detection on hundreds of object categories and millions of images. The challenge has been run annually from 2010 to present, attracting participation from more than fifty institutions. This paper describes the creation of this benchmark dataset and the advances in object recognition that have been possible as a result. We discuss the challenges of collecting large-scale ground truth annotation, highlight key breakthroughs in categorical object recognition, provide a detailed analysis of the current state of the field of large-scale image classification and object detection, and compare the state-of-the-art computer vision accuracy with human accuracy. We conclude with lessons learned in the five years of the challenge, and propose future directions and improvements.Comment: 43 pages, 16 figures. v3 includes additional comparisons with PASCAL VOC (per-category comparisons in Table 3, distribution of localization difficulty in Fig 16), a list of queries used for obtaining object detection images (Appendix C), and some additional reference

arXiv.org e-Print Archive

DSpace@MIT

Crossref

Carolina Digital Repository

Staged learning of agile motor skills

Author: Karpathy Andrej
Publication venue: University of British Columbia Press
Publication date: 01/11/2011
Field of study

Motor learning lies at the heart of how humans and animals acquire their skills. Understanding of this process enables many benefits in Robotics, physics-based Computer Animation, and other areas of science and engineering. In this thesis, we develop a computational framework for learning of agile, integrated motor skills. Our algorithm draws inspiration from the process by which humans and animals acquire their skills in nature. Specifically, all skills are learned through a process of staged, incremental learning, during which progressively more complex skills are acquired and subsequently integrated with prior abilities. Accordingly, our learning algorithm is comprised of three phases. In the first phase, a few seed motions that accomplish goals of a skill are acquired. In the second phase, additional motions are collected through active exploration. Finally, the third phase generalizes from observations made in the second phase to yield a dynamics model that is relevant to the goals of a skill. We apply our learning algorithm to a simple, planar character in a physical simulation and learn a variety of integrated skills such as hopping, flipping, rolling, stopping, getting up and continuous acrobatic maneuvers. Aspects of each skill, such as length, height and speed of the motion can be interactively controlled through a user interface. Furthermore, we show that the algorithm can be used without modification to learn all skills for a whole family of parameterized characters of similar structure. Finally, we demonstrate that our approach also scales to a more complex quadruped character.Science, Faculty ofComputer Science, Department ofGraduat

University of British Columbia: cIRcle - UBC's Information Repository

Object Discovery in 3D scenes via Shape Analysis

Author: Andrej Karpathy
Li Fei-fei
Stephen Miller
Publication venue
Publication date: 02/08/2013
Field of study

Abstract — We present a method for discovering object models from 3D meshes of indoor environments. Our algorithm first decomposes the scene into a set of candidate mesh segments and then ranks each segment according to its ”objectness ” – a quality that distinguishes objects from clutter. To do so, we propose five intrinsic shape measures: compactness, symmetry, smoothness, and local and global convexity. We additionally propose a recurrence measure, codifying the intuition that frequently occurring geometries are more likely to correspond to complete objects. We evaluate our method in both supervised and unsupervised regimes on a dataset of 58 indoor scenes collected using an Open Source implementation of Kinect Fusion [1]. We show that our approach can reliably and efficiently distinguish objects from clutter, with Average Precision score of.92. We make our dataset available to the public. I

CiteSeerX

Crossref

Deep fragment embeddings for bidirectional image sentence mapping

Author: Andrej Karpathy
Armand Joulin
Li Fei-fei
Publication venue
Publication date: 01/01/2014
Field of study

We introduce a model for bidirectional retrieval of images and sentences through a deep, multi-modal embedding of visual and natural language data. Unlike pre-vious models that directly map images or sentences into a common embedding space, our model works on a finer level and embeds fragments of images (ob-jects) and fragments of sentences (typed dependency tree relations) into a com-mon space. We then introduce a structured max-margin objective that allows our model to explicitly associate these fragments across modalities. Extensive exper-imental evaluation shows that reasoning on both the global level of images and sentences and the finer level of their respective fragments improves performance on image-sentence retrieval tasks. Additionally, our model provides interpretable predictions for the image-sentence retrieval task since the inferred inter-modal alignment of fragments is explicit.

CiteSeerX

L'archive ouverte de L'Ined